New Perspectives in Sinographic Language Processing through the Use of Character Structure
نویسنده
چکیده
Chinese characters have a complex and hierarchical graphical structure carrying both semantic and phonetic information. We use this structure to enhance the text model and obtain better results in standard NLP operations. First of all, to tackle the problem of graphical variation we define allographic classes of characters. Next, the relation of inclusion of a subcharacter in a characters, provides us with a directed graph of allographic classes. We provide this graph with two weights: semanticity (semantic relation between subcharacter and character) and phoneticity (phonetic relation) and calculate “most semantic subcharacter paths” for each character. Finally, adding the information contained in these paths to unigrams we claim to increase the efficiency of text mining methods. We evaluate our method on a text classification task on two corpora (Chinese and Japanese) of a total of 18 million characters and get an improvement of 3% on an already high baseline of 89.6% precision, obtained by a linear SVM classifier. Other possible applications and perspectives of the system are discussed.
منابع مشابه
A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملA Reflection on Kristeva's Approach to the Structure of Language
Reaching out to history and subject in terms of meaning variation, Kristeva could show that language cannot simply be a Saussurean sign system. Rather, she went on to delineate that language, beyond signs, is associated with a dynamic system of signification where the ''speaking subject'' is constantly involved in processing. Julia Kristeva, a French critic, psychoanalyst, theoretician, a post-...
متن کاملCommunication Strategies Revisited: Looking beyond Interactional and Psycholinguistic Perspectives
Second language (L2) communication strategies (CSs) have traditionally been dealt with through either interactional or psychological perspectives. However, this paper is a critical attempt to question the status of the particular kinds of psycholinguistic and interactional approaches that currently dominate the field of second language acquisition (SLA). In this way, it expands the significance...
متن کاملPygmalion in Conversation with Pierre Bourdieu:A Sociological Perspective
George Bernard Shaw's masterpiece Pygmalion deals with the social function of language and reveals that Linguistic Competence is one of the markers of social status. It presents the story of the social transformation of a flower girl into a ‘lady’ through linguistic retraining. This work has been analyzed from a variety of perspectives such as Freudian psychology and sociolinguistic perspective...
متن کاملNew Perspectives into Rhetorical Structure of Simile in the Poems of Bidel Dehlavi
Through a rhetorical system, poets’ and writers’ worldview is reflected in their literary works. One effective factor in forming or changing these viewpoints is the happenings and changes of the world poets dwell on different levels. Simile is an important structure in the rhetorical system. In this piece of research the writers have tried to show how structure of simile in Bidel reveals transf...
متن کامل